Modeling Transcriptome Based on Transcript-Sampling Data
نویسندگان
چکیده
BACKGROUND Newly-evolved multiplex sequencing technology has been bringing transcriptome sequencing into an unprecedented depth. Millions of transcript tags now can be acquired in a single experiment through parallelization. The significant increase in throughput and reduction in cost required us to address some fundamental questions, such as how many transcript tags do we have to sequence for a given transcriptome? How could we estimate the total number of unique transcripts for different cell types (transcriptome diversity) and the distribution of their copy numbers (transcriptome dynamics)? What is the probability that a transcript with a given expression level to be detected at a certain sampling depth? METHODOLOGY/PRINCIPAL FINDINGS We developed a statistical model to evaluate these parameters based on transcriptome-sampling data. Three mixture models were exploited for their potentials to model the sampling frequencies. We demonstrated that relative abundances of all transcripts in a transcriptome follow the generalized inverse Gaussian distribution. The widely known beta and gamma distributions failed to fulfill the singular characteristics of relative abundance distribution, i.e., highly skewed toward zero and with a long tail. An estimator of transcriptome diversity and an analytical form of sampling growth curve were proposed in a coherent framework. Experimental data fitted this model very well and Monte Carlo simulations based on this model replicated sampling experiments in a remarkable precision. CONCLUSIONS Taking human embryonic stem cell as a prototype, we demonstrated that sequencing tens of thousands of transcript tags in an ordinary EST/SAGE experiment was far from sufficient. In order to fully characterize a human transcriptome, millions of transcript tags had to be sequenced. This model lays a statistical basis for transcriptome-sampling experiments and in essence can be used in all sampling-based data.
منابع مشابه
Supplementary Figures and Tables for “patteRNA: transcriptome-wide search for functional RNA elements via structural data signatures”
To assess computational requirements for mining motifs in transcriptome-wide datasets, we simulated datasets of varying sizes featuring diverse transcript lengths. Transcripts’ sequences were simulated using a uniform nucleotide model (p=0.25 for A/C/G/U). SHAPE profiles were simulated in a two-stage process. First, we generated a sequence of pairing states using an HMM with parameters learned ...
متن کاملSystems Level Modeling of the Cell Cycle Using Budding Yeast
Proteins involved in the regulation of the cell cycle are highly conserved across all eukaryotes, and so a relatively simple eukaryote such as yeast can provide insight into a variety of cell cycle perturbations including those that occur in human cancer. To date, the budding yeast Saccharomyces cerevisiae has provided the largest amount of experimental and modeling data on the progression of t...
متن کاملA rapid method for computationally inferring transcriptome coverage and microarray sensitivity
MOTIVATION There are many different gene expression technologies, including cDNA and oligo-based microarrays, SAGE and MPSS. For each organism of interest, coverage of the transcriptome and the genome will be different. We address the question of what level of coverage is required to exploit the sensitivity of the different technologies, and what is the sensitivity of the different approaches i...
متن کاملDiffSplice: the genome-wide detection of differential splicing events with RNA-seq
The RNA transcriptome varies in response to cellular differentiation as well as environmental factors, and can be characterized by the diversity and abundance of transcript isoforms. Differential transcription analysis, the detection of differences between the transcriptomes of different cells, may improve understanding of cell differentiation and development and enable the identification of bi...
متن کاملIdentifying differentially expressed transcripts from RNA-seq data with biological variation
MOTIVATION High-throughput sequencing enables expression analysis at the level of individual transcripts. The analysis of transcriptome expression levels and differential expression (DE) estimation requires a probabilistic approach to properly account for ambiguity caused by shared exons and finite read sampling as well as the intrinsic biological variance of transcript expression. RESULTS We...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- PLoS ONE
دوره 3 شماره
صفحات -
تاریخ انتشار 2008